Conversation
|
That still leaves the partitions separate, so this is great if a user wants to load each partition separately, but doesn't work for when a user needs the model weights consolidated. Also I don't think this PR should do this by default as it adds an overhead that most users won't need. So it should be configurable. And also as suggested elsewhere the model_states.pt file with fake weights probably shouldn't even be saved as it just confuses the users who try to load it and it's guaranteed to fail. |
| def save_partitioned_weights(self, state_dict): | ||
| for name, param in self.module.named_parameters(): | ||
| if name in state_dict.keys(): | ||
| state_dict[name] = param.ds_tensor |
There was a problem hiding this comment.
Found an issue here: param.ds_tensor in this place appears to be is a flattened buffer. So state_dicts ends up being populated with 1D vectors.
There was a problem hiding this comment.
but we can't shape it back to the original since we only have a part of the tensor, so doing something like narrow(0, 0, param.ds_numel).view(param.ds_shape) from _allgather_param() won't work and the shape has no meaning here anyway.
So this line of logic is useful when it's used to load the param.ds_tensor directly by each gpu, as coded in the rest of this PR.
I just tried to use it to get the partitioned fp16 weights, but now I understand this is not possible using this approach.
Bottom line - there is no problem here, just needed to understand that this is not a real state_dict that is being saved but something like flattened_params_state_dict.
All is good!
Save ZeRO3 (partitioned) fp16 weights. This is a first step to using ZeRO3 weights outside DeepSpeed, #872.